241 research outputs found
A Pattern Matching method for finding Noun and Proper Noun Translations from Noisy Parallel Corpora
We present a pattern matching method for compiling a bilingual lexicon of
nouns and proper nouns from unaligned, noisy parallel texts of
Asian/Indo-European language pairs. Tagging information of one language is
used. Word frequency and position information for high and low frequency words
are represented in two different vector forms for pattern matching. New anchor
point finding and noise elimination techniques are introduced. We obtained a
73.1\% precision. We also show how the results can be used in the compilation
of domain-specific noun phrases.Comment: 8 pages, uuencoded compressed postscript file. To appear in the
Proceedings of the 33rd AC
K-vec: A New Approach for Aligning Parallel Texts
Various methods have been proposed for aligning texts in two or more
languages such as the Canadian Parliamentary Debates(Hansards). Some of these
methods generate a bilingual lexicon as a by-product. We present an alternative
alignment strategy which we call K-vec, that starts by estimating the lexicon.
For example, it discovers that the English word "fisheries" is similar to the
French "pe^ches" by noting that the distribution of "fisheries" in the English
text is similar to the distribution of "pe^ches" in the French. K-vec does not
depend on sentence boundaries.Comment: 7 pages, uuencoded, compressed PostScript; Proc. COLING-9
Statistical Augmentation of a Chinese Machine-Readable Dictionary
We describe a method of using statistically-collected Chinese character
groups from a corpus to augment a Chinese dictionary. The method is
particularly useful for extracting domain-specific and regional words not
readily available in machine-readable dictionaries. Output was evaluated both
using human evaluators and against a previously available dictionary. We also
evaluated performance improvement in automatic Chinese tokenization. Results
show that our method outputs legitimate words, acronymic constructions, idioms,
names and titles, as well as technical compounds, many of which were lacking
from the original dictionary.Comment: 17 pages, uuencoded compressed PostScrip
GlobalTrait: Personality Alignment of Multilingual Word Embeddings
We propose a multilingual model to recognize Big Five Personality traits from
text data in four different languages: English, Spanish, Dutch and Italian. Our
analysis shows that words having a similar semantic meaning in different
languages do not necessarily correspond to the same personality traits.
Therefore, we propose a personality alignment method, GlobalTrait, which has a
mapping for each trait from the source language to the target language
(English), such that words that correlate positively to each trait are close
together in the multilingual vector space. Using these aligned embeddings for
training, we can transfer personality related training features from
high-resource languages such as English to other low-resource languages, and
get better multilingual results, when compared to using simple monolingual and
unaligned multilingual embeddings. We achieve an average F-score increase
(across all three languages except English) from 65 to 73.4 (+8.4), when
comparing our monolingual model to multilingual using CNN with personality
aligned embeddings. We also show relatively good performance in the regression
tasks, and better classification results when evaluating our model on a
separate Chinese dataset.Comment: Submitted and accepted to AAAI 2019 conferenc
- …